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Classification of Colon Cancer 

2 4 JAM. 2004 
Modtaget 

Background 

Colon cancers microsatellite instability status is a better marker for response to adjuvant 
chemotherapy with fluorouracii than tumour stage II and HI. We investigated the possibility of 
classifying colon tumors based on gene expression and correlated these to crude survival. 



Methods 

Gene transcripts from tumour specimens were quantified using microarray technology. The tumors 
were clustered using unsupervised and supervised classification algorithms. 



Results 

Unsupervised hierarchical clustering revealed that tumors were essentially separated according to 
microsatellite instability status. Supervised classification of the 97 tumor samples using a maximum 
likelihood classifier with a crossvalidation loop resulted in tree misciassification as compared to 
microsatellite analysis using from 106 genes and down to only seven genes. The stability of 
classification of colon tumors in relation to microsatellite status was tested by permutation analysis. 
The sensitivity for diagnosis of microsatellite stable tumors exceeded 99% with a specificity 
exceeding 96%. The positive and negative predictive values exceeded 95% and 98%, respectively. 
The classifier was demonstrated not to be platform dependent as it could successfully be reproduced 
by real-time PCR. Crude survival according to microsatellite status as determined by the classifier, 
revealed that stage II colon receiving no adjuvant chemotherapy, that patient displaying 
microsatellite instability had significantly longer overall survival than patient exhibiting 
microsatellite stable tumors (P=0.0014). 
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By contrast, the patient with Dukes" C tumors displaying microsatellite instability did not have a 
significant increase in overall survival as compared to patient exhibiting microsatellite stable tumors 
(P=0.55). 
Conclusion 

Colon cancer can be stratified into two molecular distinct groups by quantification of the transcripts 
of 106 genes or even down to seven genes. The two groups are highly correlated with microsatellite 
stable (MSS) and microsatellite instable (MSI) tumors. The 7-gene classifier clearly proved to be a 
strong predictor of survival in Dukes B and it can be used to select patients who need adjuvant 
chemotherapy, namely those classified as MSS. We demonstrate that this classification is also valid 
when performed by real-time PCR analysis allowing a fast diagnosis in a clinical setting. 
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Colon is the fourth most frequently diagnosed malignancy and the second most common cause of 
cancer death in the western world. The standard treatment of colon cancer is advised according to 
tumor stage. Patient with Dukes' C colon cancer receives a flurouracil-based adjuvant systemic 
chemotherapy in addition to surgical resection of the tumor, whereas the treatment for Dukes' B 
patients is based alone on surgical resection. 

There is accumulating evidence that these cancers belong to two distinct molecular types according 
to genetic alterations. The mutator phenotype featuring tumors with microsatellite instability (MSF) 
and the suppressor pathway displying chromosomal instability and microsatellite stable (MSS). 
MSI has been defined as a change of any length due to either insertions or deletions of repealing 
units in a microsatellite within a tumor compared to normal tissue and is caused by an underlying 
defect in the mismatch repair (MMR) system. (Boland et al, CR 1998, 58:5248). The MSI pathway 
may either be sporadic or hereditary (HNPCC) and whereas the disruption of the MMR system in 
sporadic MSI tumors is most often caused by somatic methylalion of the MLH 1 gene promoter 
more that 90% of HNPCC cancers are caused by germiine mutations in MLH1 or MSH2. 
The MSS pathway to cancer begins with the inactivation of tumor suppressor genes, such as 
APC/p-catenin genes, followed by activation of oncogenes and inactivation of additional tumor 
suppressor genes, commonly with a high frequency of allelic losses and cytogenetic abnormalities 
and abnormal DNA tumor content. Many studies have defined the pathoclinical trait of MSI and 
MSS tumors and found that MSI positive cancers are most frequently found in the right side of the 
colon, they tend to be of less differentiated, they tend to be larger in size, are often mucinous and 

V 

often exhibit extensive infiltration by lymphocytes. 

Crude survival data suggest that patients with HNPCC have a better prognosis than those with 
sporadic disease [48,49,50] and studies have also shown that MSI is an independent indicator of 
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good prognosis [35,52,53]. Recently it was shown that MSS benefit from 5-FU treatment/leucovin 
treatment (New England J Med, august 2 nd , 2003) whereas MSI cancer patients gained no advantage 
in survival. 

Gene expression profiling has become an increasing used method for classification, outcome 
prediction, prediction of response (for a review see Dyrskjot, expert opinion). Most such studies 
dealing with colon cancer have dealt with the identification of general tumor markers (Alon U; Levine 
AJ PNAS (1999); Kitahara O; Tsunoda T., Cancer Res (2001); Notterman DA; Levine AJ, Cancer Res 
(2001); Yanagava R; Nakamura (2001); Zou TT; Meltzer SJ, Oncogene (2002), Demtroeder CR (2002)), 
markers for benign adenomas versus adenocarcinomas (Lin YM; Nakamura Y, Oncogene (2002); Williams 
NS; Becerra C., Clin Cancer Res (2003)), staging (Fredcriksen CM; Orntoft TF, J Cancer Res Clin 
Oncol (2003)), or liver metastasis (Takemasa i; Matsubara K., Biochem Biophys Res Commun (2001); 
Yanagawa R; Nakamura, Neoplasia (2001); Agrawal D; Yeatman T, J Natl Cancer Inst (2002)) One study 
has addressed the separation of low-frequency microsatcllite instability tumors (MSNL) from MSI 
and MSS (PCA) (Mori Y; Meltzer SJ, Cancer Res (2003)). 

The aim of this study was to build a general applicable and robust classifier based on gene 
expression to separate MSS from MSI tumors. To achieve such robustness the tumors for this study 
were collecting from 14-16 different clinics, RNA was isolated using different mediods and labelled 
in several batches. Gene expression was measured by DNA microarrays of 101 Danish and Finnish 
tumors from primary colon cancer patients along with 17 normal biopsies. 

Methods 

Biological material From the Danish and Finnish CRC tissue banks 101 primary colon cancers and 
17 macroscopically normal colon epithelium samples from the oral resection edge were chosen. 
Only adenocarcinomas from Dukes* stage B and C were included, however, these represented a 
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broad spectrum of tumors in relation to location, heredity, microsatellite instability status, and 
origin of the patient. Ail tumors were collected in the period from 1994 to 2002, 68 tumor samples 
were collected at nine different clinics in Finland and 33 samples were collected at four different 
clinics in Denmark, 36 were Dukes' B, 67 Dukes* C, 41 were sporadic microsatellite highly instable 
(MSI-H) of which were 17 HNPCC, and 59 were sporadic microsatellite stable (MSS) (table 1). 
None of the patients received pre-operative radiation or chemotherapy. 

Microsateliite-instabiHty analysis. From all tumor samples available as paraffin blocks, ten 
sections were cut at lOjim and stained with haematoxyiin. The first and last section was cut at 4 jim 
and stained with haematoxyiin. These two sections were used for the identification of tumor and 
normal cells from each sample. Regions enriched in tumor cells (more than 90%) were 
microdissected from these sections and DNA was extracted using a Puregene DNA extraction kit 
(Gentra Systems, Minneapolis. MN). DNA from blood samples was used as control when available, 
otherwise normal tissue was microdissected from the tissue sections. The samples were analysed for 
microsatellite instability according to the NCI guidelines (Boland et al). Samples positive for 
markers BAT25 and BAT26 were scored as MSI-H. Samples positive for only one of these markers 
were tested for further markers and scored as MSI-L if none of these tested positive. Since MSI-L 
has similar clinical features as MSS these samples were considered as MSS in this study. In 
addition to microsatellite analysis all tumors from which paraffin blocks were available were tested 
for the presence of MLH1 and MSH2 protein by immiinohistochemistry. None of the samples 
scored MSS were negative for either protein whereas six of the MSI scored samples were positive 
for both (Table 1). 
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RNA purification Colon specimens were obtained fresh from surgery and were immediately snap 
frozen in liquid nitrogen either as was, in OCD-compound or in a SDS/guadinium thiocyanatc 
solution. Total RNA was isolated using RNAzol (WAK-Chemie Medical) or spin column 
technology (Sigma) following the manufactures* instructions. 

Gene expression analysts These procedures were performed at described in detail elsewhere 
(Dyrskodt et al). Briefly, ten \ig of total RNA was used as starting material for the target preparation 
as described. First and second strand cDNA synthesis was performed using the Superscript II 
System (Invitrogen) according to the manufacturers* instructions except using an oligo-dT primer 
containing a T7 RNA polymerase promoter site. Labelled aRNA was prepared using the BioArray 
High Yield RNA Transcript Labelling Kit (Enzo) using Biotin labelled CTP and UTP (Enzo) in the 
reaction together with unlabeled NTP's. Unincorporated nucleotides were removed using RNeasy 
columns (Qiagen). Fifteen fig of cRNA was fragmented, loading onto the Affymetrix HGJJ133A 
probe array cartridge and hybridized for 16h. The arrays were washed and stained in the Affymetrix 
Fluidics Station and scanned using a confocal laser-scanning microscope (Hewlett Packard 
GeneArray Scanner G2500A). The readings from the quantitative scanning were analyzed by the 
Affymetrix Gene Expression Analysis Software (MAS 5.0) and normalized using RMA (robust 
multi array normalisation, lrizarry et aL 2002) in the statistical application R. Redundant probeseis 
(as defined form Unigene build 168) with high correlation (>0.5) over all samples were removed, 
which reduced the dataset to approximately 14.400 probesets. This dataset was used a source for all 
further calculations in this manuscript. 
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Unsupervised agglomerative hierarchical clustering 

For hierarchical cluster analysis 1239 genes with a variation across all samples greater than 0.5 
were median-centred to a magnitude of 1. Samples and genes were then clustered using average 
linkage clustering with a modified Person correlation as similarity metric (Eisen et ah, PNAS 95: 
J 4863- 14868, 1998). The cluster dendrogram was visualized with TreeView (Eisen). 

Group testing 

We make a statistical test where the p-value is evaluated through permutations. For each group and 
gene we calculate the average and the sum of squared deviations from the average. We then sum 
these over the genes and the groups: 



This expression is calculated for joining DK with SF and MSI with MSS such thai we end up with 
two groups. The sum of squared deviations is denoted S2. As a test statistic we use S1/S2. A small 
value indicates that there is a real reduction in the deviations when going from 2 to 4 groups and 
thus the groups have a real significance. To judge if a value is significantly small we use 
permutations. For each of the four groups left when joining DK and SF we randomly allocate the 
members to a pseudo DK and pseudo SF in such a way that the number of members in each group 
are as in the original data 

To get an understanding of this separation we performed a test to see if this is caused by few genes 
or if many genes are involved. For this test we calculated Si - Sgenes Si(gene) and similarly with S2 

S2(gene). For each gene j we used the lest statistic Si(j)/S2(j) (1 able 3), 
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Multidimentional Scaling 

We carried out multidimentional scaling on median-centered and normalized data using CMD — 
scale in the statistical application R and visualized in a two-dimentional plot. 

Microsatellite status classifier 

The readings from the quantitative scanning were analyzed by the Affymctrix Gene Expression 
Analysis Software (MAS 5.0) and normalized using RMA (robust multi array normalisation, 
Irizarry et al 2002) in the statistical application R. Redundant probesets (as defined form Unigene 
build 168) with high correlation (>0.5) over all samples were removed, which reduced the dataset to 
approximately 14.400 probesets. 

The microsatellite instability status classifier was based on a dataset of 4.266 genes. These genes 
result from the removal of genes with a variance over all tumor samples smaller than 0.2 and genes 
that separate Danish from Finnish samples with a t- value numerically greater than 2. We used a 
normal distribution with the mean dependent on the gene and the group (MSI, MSS). For each gene, 
we calculated the variation between the groups and the variation within the groups to select genes 
with a high ratio between these. To classify a sample, we calculated the sum over the genes of the 
squared distance from the sample value to the group mean, standardized by the variance and 
assigned the sample to the nearest group. The sample to be classified was excluded when 
calculating group means and variances. 
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Estimation of classifier stability 

Wc validated the performance of the classifier by permutation. One hundred datasets consisting of 
30 MSS samples and 25 MSI samples were randomly chosen by permutation for training of the 
classifier with the remaining samples in each case being assign to a testsct. Averages over the 1 00 
data sets of the number of errors in the cross-validation of the training set and in the test set were 
used as a measure of the precision of the classifier. 

Real-time PGR (RT-PCR). The procedures were as described (Birkenkarnp-Derntroder) except 
that we used short LNA (Locked Nucleic Acid) enhanced probes from a Human Probe Library 
(Exiqon™). In short, cDNA was synthesized from single samples some of which were previously 
analyzed on GeneChips. Reverse transcription was performed using Superscript II RT (Invitrogen). 
Real-time PCR analysis was performed on selected genes using the primers (DNA Technology) and 
probes (Exiqon, DK) described in figure legend X. All samples were normalized to GAPDH as 
described previously (Birkenkarnp-Derntroder et. aL Cancer Res., 62: 4352-4363, 2002). 

Rebuilding of Classifier based on Real-Time PCR 

The 79 tumors samples that were not analysed by real-time PCR were transformed into log ratios 
using one of the tumor samples as reference and used for training of the classifier. Then 23 samples 
of which 18 were also analyzed on arrays were equally transformed into log ratios using the same 
tumor sample as above as reference and tested. 
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Results 

Hierarchical Clustering 

The clinical specimens used in this study were collected in two different countries from 14 different 
clinics in the period 1994 to 200 L The samples were selected to keep a balanced representation of 
microsatellite instable (MSI) and microsatellitc stable (MSS) tumors from both the right- and Icft- 
sided colon. The MSI class was represented both by sporadic MSI and hereditary MSI (HNPCC) 
tumors. Only Dukes' B and Dukes' C tumor samples were included were selected (table I). Before 
any attempt to divide a diverse sample collection into distinct classes analyzed the data for 
systematic bias that may have been introduces during the experimental procedures. A fast and easy 
way to discover both true distinct classes as well as systematic biases in the data is to perform a 
hierarchical clustering. 

The phylogenetic tree resulting from hierarchical clustering on 1239 genes (fig I) reveals that the 
main separating factor is microsatellite status. On the upper trunk we find two clusters represented 
mainly by normal biopsies (14/21) and MSS tumors (18/25), respectively. The lower trunk is 
divided into a MSI cluster (30/36) and a second MSS cluster (MSS2-cluster) (34/37). A closer 
inspection of the two MSS clusters unveil that one is dominated by Danish samples (1 9/25) and one 
by Finnish samples (26/37 check). Also, it is worth to notice that the MSI cluster contains a vast 
majority of Finnish samples (32/36) and that the sporadic MSI samples are interspersed among the 
hereditary samples. The normal biopsies cluster tight together with a slight tendency to separation 
according to origin. Tree normal samples cluster within the MSI cluster indicating that resection of 
these samples may have been to close to the tumor lesion. 

Inspection of the gene cluster dendrogram shows that the two groups of MSS tumors are mainly 
separated by a large cluster of genes being upregulated in the Danish samples (data not shown) 
indicating that a systematic difference between Danish and Finnish samples. 
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Significance of Observed Groups * 

Based on these observations, we performed a series of test to evaluate if the observed separation of 
tumors into MSS and MSI as well as DK and SF are significant For these tests the tumor samples 
were grouped into tour virtual tumor-groups labelled, i.e. Danish MSI (MSI-DK). Danish MSS 
(MSS-DK), Finnish MSI (MSI-SF) and Finnish MSS (MSS-SF). Based on 5082 genes with a 
variance above 0.2, we tested if all four groups are significant or if some of the groups can be 
joined. We considered the two possibilities of joining DK and SF, and of joining MSI and MSS and 
made a statistical test where the p-value is evaluated through permutations. In 100 permutations of 
each group combination our test value SI/S2 is considerably smaller than in all permutation (Table 
2) demonstrating a very clear separation between DK and SF and between MSI and MSS. Such a 
clear distinction between groups may rely on a few highly separating genes or a general difference 
in the gene expression profile including many genes. For both the DK-SF and MSI-MSS the effect 
are caused by many genes even at very criteria, i.e. low test statistic Si(j)/S2(j) values (Table 3). 
When a property is present that influences a large proportion of the genes this may obscure 
separation of clinical relevant features in unsupervised clustering. To visualize the effect of such 
properties, we calculated distances by multidimensional scaling between samples with and without 
of 816 genes separating DK from SF with a t-value numerically greater than 2 (Fig 2). We see an 
improved separation of MSI and MSS with Danish and Finnish cases mixed. The MSi-DK samples 
are not completely separated as they are found both between the MSI-SF and the MSS samples. 
(These plots arc not entirely unsupervised since the groups have been used to remove gene). 
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Construction of an MSI-MSS classifier 

For the construction of a classifier we used the expression profiles from 97 tumors for which no 
ambiguity had been identified in relation to microsatellite status. The 816 genes separating DK from 
SF were excluded, as these would be unreliable for MS classification. We built a maximum 
likelihood classifier in order to select a minimum of genes giving the largest possible separation of 
the two groups. We tested the performance of the classifier using 1-1000 genes and found that it 
was stable showing 3-6 errors when using 4 - 400 genes. Of these 106 genes were especially suited 
for discrimination of MSS from MSI (table 4). The minimum of three errors was found even using 
only 7 genes (Table 5). 

Classification of ambiguous samples 

Application of the 7-gene classifier to the four samples showing ambiguity in the microsatellite 
analyses assigns all four to be microsatellite stable tumor class. Notably, all four showed expression 
levels of Tumor Growth Factor 0 induced protein (TFGBI), MLH1 and thymidylaie synthase 
(TYMS) that are atypical for MSI tumors. Furthermore, these tumors were all from the left colon. 
Thus the misclassified tumors are clearly truly MSS or they belong to a yet undefined class of MSI 
tumors. 

Stability of classification 

To estimate the stability of the classifier based on all 97 tumor samples, we generated one hundred 
new classifiers based on randomly chosen dalasets consisting of 30 MSS and 25 MSI samples. In 
each- case the classifiers were tested with the remaining samples. The perfonnancc for each set was 
evaluated and averaged over all 100 training and test sets (Table 6). The mean error rate for MSS 
tumors was 0.52% and 1 .38% for MSI tumors. The seven genes defined above were found to be 
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those genes that were most frequently used in the crossvalidation loop. More than 50% of the errors 
were related to three tumors of which two were wrongly classified in all permutation and one in 
94%. The remaining errors were mainly caused by four tumors with error rates of 40-47% showing 
that the former three samples are truly assigned contradictory to result from the microsatellite 
analysis and that four samples could not be assigned with confidence too any of the classes. 

Cross platform classification 

Real time PCR was applied both to verify the array data and examine if the 7-gene classifier would 
also perform on this platform. We chose 23 samples of which 1 8 were also analyzed on arrays. The 
correlation between the two platforms was high (data not shown). In order to test the performance 
of classification using PCR data we re-build our classifier with a 79 samples array dataset including 
only those tumors that were not analyzed with PCR. Two samples were classified in discordance 
with the microsatellite instability test of which one of them was ambiguously classified by the 7- 
gene array classifier. 

Relation between microsatellite-instabHity status, stage and survival 

Based on the 7-gene classifier, classification of 36 patients with Dukes 5 B tumors receiving no 
adjuvant chemotherapy, 18 were classified as MSI tumors and 18 as MSS tumors. The overall 
survival was highly significantly related to the classification since all nine patients that died within 
five years of follow-up were belonged to the MSS group (P=0.0G14) (Fig. 3 A). Thus, the 7-gene 
classifier clearly proved to be a strong predictor of survival in Dukes B and it can be used to select 
patients who need adjuvant chemotherapy, namely those classified as MSS. 
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Among 65 patients with Dukes' C tumors receiving adjuvant chemotherapy, 17 were classified as 
MSI tumors and as 48 MSS tumors. Of these, 6 MSI and 27 MSS patients died within five years of 
follow-up meaning no significant difference in overall survival between these groups (P=0.55) 
(FigJB). A trend was that the MSI showed a poorer short-term survival than the MSS, contrary to 
Dukes B patients. This difference can be attributed to the fact that a recent large study has shown 
that chemotherapy only benefit the MSS tumor patients, thus improving their survival to a level 
comparable to that which is characteristic of MSI tumor patients. 

Clinical application of the discovery 

In the clinic the 106 or less genes described can be used for predicting outcome of colorectal cancer 
when examined at the RNA level and also on the protein level as each gene identified is the project 
is transcribed to RNA that is further translated into protein. The RNA determination can be made in 
any form using any method that will quantify RNA. The proteins can be measured with any method 
quantification method that can determined levels of protein. 
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Figure Title. 

Figure 1 . Phylogenetic tree resulting from unsupervised hierarchical clustering. 

Figure 2. Multidimentional scaling plot. 

Figure 3. Kaplan Meier estimates of overall survival. 

Table 1 . Summary of clinicopathological and microsatellitc features of colon cancer sam 
Table 2. Permutation test of groups 
Table 3. Permutation test of genes 
Table 4. Performance of the classifier 

Table 5. Genes used for the classification of MSS vs MSI tumors 
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pgiice 1. Clusteranalysis of Colon Specimens with Associated Clinicopathological Features. 
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Figure 2. Multidimeiitional Analysis showing distances between groups of tumors. 
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Figure 3. Kaplan-Meier Estimates of Overall Survival among Patients with Dukes' B and 
Dukes' C Colon Cancer According to the Microsatellite-Instability Status of the Tumor. 
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Table 2. Permutation test of groups 



Pseudo 
group 


Sl/S2from data 


Smaller values in 
100 permutations 


Minimum in 100 
permutations 


DK-SF 
I-S 


0.9072795 
0.9166195 


0 
0 


0.962269 
0.9583325 



Table 3. Permutation test of genes 









S,(i)/S 2 (i) 




Pseudo group 




<0.6 


< 0.7 < 0.8 


<0.9 


DK-SF 


number of genes 

max in 100 permutations 


36 
0 


136 522 
0 2 


1785 
225 


MSI-MSS 


number of genes 

max in 100 permutations 


17 
0 


103 399 
1 8 


1507 
250 



AFFYiD 
1405 1 at 
200628 5 at 

200814 at 


SYMBOL 

SSL§ . 

WARS 


LOCUSUfJK 

£2--. 

7453 

5720 


Hp l 


REF5BQ 

NM_O08283, 
NM_006263 


" GENENAME 

oroteosome (prosome. macrooain) activator subumi 1 (PA26 alpha) 


201641 at 
201649 at 


BST2 
U8E2L6 


604 
9246 


603890 


NMJXMMS 
NM_Q04223, 
NM 004223 


« 

ublqumn-Gcnftjgatlng enzyme E2L 6 


201674 5 at 
C0i762_s_at 
20l884_at 


AKAP1 
PSME2 
CEACAM5 


6165 
5721 
1048 


602449 
60216.1 


NM 0034 Bd, 
NM 003489 

NM_CQ43g? 


A kinase (PRKA) a/ichor protein 1 _ _ 

oroleasome (orosome. macrooam) activator subunil 2 f PA28 beta) 

carcinoembrvonic anligon-related celt adhesion molecule 5 

ffrm. RhoGEF fARHGEF) and pleckstnn domain protein 1 tchondrocvto-donVed) 


901910 at 
201976 5 at 
202072 at 

202203 % at 
202262_x_at 
202510 .9 at 
20252D » at 
202589 at 
202637 » at 
202678_at 


W 

HNRPL 

AMFR 

□DAH2 

TNFAIP2 

MLH1 

TYMS 

ICAM1 

GTF2A2 


10160 

.4S&- 
3191 

267 
23564 
7127 
4292 
7298 
3383 
2956 


603083 

603243 
604744 
€03300 
120436 
183350 
147B40 
600519 


MM 005766 

NM 00.1*33 
NMJ001 144. 
NM_001 144 
NM 913974 
J4M, 006291 

NM 000201 


myosin X 

haterogBneous nuclear nbonucleoprotain L . 

autocrine motility factor receptor — 

dimethytargmina dirnethyJaminohydrojase 2 — 

mutt, homolool. colon cancer. nonpolvoosls type 2 (E coU) 

intercellular adhesion molecufe 1 (CDS4), human rhmovirua receptor 
general transcription factor HA. 2. 12k0a 


202762 at 

203315 at 
203335 at 
203444 s at 
203559 S at 
203773„x at 

_Ai«JPW_s_ai 
203915 at 
204020 at 
204044 at 


ROCK2 
APACD 

PHYH 
MTA2 
ABP1 
BLVRA 

PLCB4 
CXCL9 
PURA 
QPRT 


9475 
10190 

8440 
5264 
9219 
26 
644 

5332 
4283 
5813 
23475 


604002 

604930 
602026 
603947 
104610 
109750 

600810 

<?PJZ2* 

600473 
606248 


NM 005783 
NM 003581 
NM ,006214 
NM 004739 
NM 001091 
NM 000712 
NMJJ00933, 

NM 002418 
NM 005859 
NM 014288 


Rho-associated. cotled-coa containing protatn kinase 2 __, 

ATP binding orotein associated with cell differentiation _ « 

phytanoyt-CoA hydroxylase (Refsum disease) ___.__.___, — __ _— ~— — 

metastais-associated gene family, member 2 — 

amiloride binding protein 1 (amine oxidase (copper-containing)) , 

chemokine (C-X-C motif) hgand 9 _ , — 

punne-nch elomor.t binding protein A 

quinolinate phosphonbosyltransferase (nicotinate-nucleotide pyroohosphorvlase fcarboxylatmg)) 1 


204070 at 
204103 at 
204131 s at 

204326.x. at 

204415 at ! 


RARRGS3 
9.CL4 
POXQ3A 
MT1X 

G1P3 


5920 
6351 
2309 

rfspi 


, WW? 
182284 
6026B1 
156359 

147572 


NM 004565 

NM, 00,1455 
NM 005952 
NM_002038. 
NM 002038, 
NMJ>22673 


retinoic acid receptor resoonder (tazarotene induced) 3 M 

chemokine tC-C motif) lioand 4 — 

forkhead box Q3A — 

interferon, alpha-mducible protein (clone 1FI-6-16) 


204533 at 
2Q4745_x al 


CXCL10 
MT1G 


3627 
±J95 


147310 

^6353, 


NM 001565 
NMjDQ5950, 
NM_00$O50 


chemokine (C-X-C motiO ligandiO __ 

metallothionein 1G 


i OU a ol 

204858 s at 


TNFR9FR 

6CGF1 


355 
1890 


134637 
131222 


NMJJ0QO43, 
NM_O0O043, 
NM_1S2677, 
NM 152876, 
NM^I 52875, 
NM 152672, 
NM_1 52873. 
NM 152671 
NM ,001953 


tumor necrosis factor receptor suparfamily. member 6 

endothelial ceil growth factor 1 {platelet-denved) 


205241 at 

205242 at 

205495 s at 


SCQ2 
CXCL13 

GNLY 


9997 
10563 

1057.6 


604272 
605149 


NM.005135 
NM | O06419 
NM 006433, 
NM 006433 


SCO cytochrome oxidase deficient homolog 2 fveast) 

chemokine (C-X-C motif) ligand 13 (B^cell chemoauractant) 


205831 at 
206108_.8_at 
206285 s at 


CD2, 

6FRS5 

TDGF1 


914 
6431 
6997 


188990 
601944 
187395 


NM 00J767 
NM 906275 
NM 003212 


solictng factor, argmine/senne-rich 6 


20S461 x at 
206754. s_ at 


MT1H 
CYP2B6 


4496 
1555 


156354 
123930 


NM, O9595I 
NM 000767 


cytochrome P450. family 2. subfamily B. polypeptide 6 


206907 at 
206918 s at 


TNFSF9 
RBM12 


8744 
10137 


605182 
607179 


NM 003311 
NMJJ06047. 
NM_006047 


tumor necrosis factor {hgano!) superfamily, member 9 . 

RNA binding motif protein 12 


206976 s at 


HSPH1 


1080J 




NM 006644 




207320 x at 


STAU 


6780 


601716 


NM 004602, 
NM 004602, 
NMJ017452, 
NM_017453 






INM 021246 


lymphocyte antigen 6 complex, locus G6D 








calcium binding protein P22 _ 


208022 s at 
208156 x at 


C0C14B 


8555 
B3481 


603505 


NM_003671. 
NM 003671, 
NM_033331 


CDC14 cell division cycle 14 homolog B (S cerovisiae) 


208581 x at 
208944 at 


TGFBR2 ~ 
PBKCBP1 


7048 
23613 


158359 
190182 


NM 005952 
NM M 003 ( 242 
NMJD1240B. 
NM_01240S. 
NM_1 03O47 


metallothiongm 1X , 

transforming growth factor, bata racaotor It (7D/B0KDa) 

protein kinasa C bindino protein 1 


|ia'f:» i^V-asSt. J>>.COT— — ^TUTTIM li 1 1 I I 


]NM 003270 


transmembrane 4 superfamily member 6 








pleckstnn homology domain contatmna. family B (evectms) member 1 


209546 s at 


APOU 


8542 


603742 


NMJD03661, 
NM 003661, 
NmIi45343 




|2l0029„at 


INDO 


J 2S2S 


14743E 


> NM 002164 





210103 s at 


FQXA2 


3170 


600288 


NM_D217S4, 
NM 021784 


orkhead box A2 - — — _ 


210321 _al 

210538_s_at 
211456 x at 
212057 ai 


GZMH 

BIRC3 

AF333388 

KIAA0182 


2999 
330 
23199 


1 16831 


nmjk>H85, 

NMJD011C5 
XM 050495 


qranzym© H (calhepsin G*Iiko 2 protein h*CCPXJ 

bacutoviral IAP repeat-contawtaQ 3 _____ 

<iAA0182OfOtein — 


212070_at 
2121 65 x at 

212229 s at 


MT2A, 
FBX021 


9289 
4502 

23014 


604110 
156360 


NM 005953 
NMJJ1SQQ2. 
NM 015002 


G protenvcouplod receptor 56 


212336 at 


fim_-I 


2036 


602879 


NMJ312156, 
NM 012150 


hvnathftUcaJ protein MGC21 416 _ 


212341_at 
212349 at 


MGC21416 


^B6451 
23509 


607491 


NM- 179M4. 
MM 015352 
MMJ0153S2' 


protein CHucosyUrensforaso 1 


212059 x at 

213201 s at 
213385 at 
213470 3 at 
213738 s at 
213757 at 
214617 at 


TMNT1 

CHN2 

HNRPH1 

ATP5A1 

EIP5A 

PRF1 


449^ 

7138 

3187 
49_ 
1984 
5551 


156351 

191041 
602857 
601035 
164360 
600187 
170280 


NM, 175617 
NM.0032B3, 
NM_O03263, 
XM 352926 
NM 004067 
NM 005520 
NM 00^M£ 
MM 001970 


motaHothionein IE (functional) ____»_-____ • 

troponin T1. skeletal, slow 

heterogeneous nuclear nbcnucteoprotein HI (H) __ __, _ _ 

ATP synthase. H+ transporting, mitochondrial F1 complex, alDha subunit. isoform 1. cardiac muscie 

perform 1 (pore tormina protein) — 


214924 s at 
215693 x at 
215780 s at 
216336 x at 
217727 x at 
217759 at 

217875 s at 


OIP106 
DDX27 
Hs 382039 
AL031602 
VPS35 

TMEPAI 


22906 
55661 

65737 
54765 

66937 


608112 

_gggga 

606564 


NM_P14.965 
NM 01,7,8,95 

NM 016206 

NMJ&20182. 
NM 020162, 
NmIi99169 4 
NM_1S9170 


OGT<Q»Glc-NAc transfgrasoHnteractmg protein 106 KOa 

DEAD fAsD-GI-wAla.Asp) box polypeptide 27 _ -__ 

tripartite motif-containing 44 ___________«__________-___- — 


217917 s at 


0NCL2A 


63658 


607167 


NMJM4183. 
NM_0141B3. 
NML177953 


dynotn cytoplasmic, light polypeptide 2A _ 


217933 s at 
21 8094 s at 


LAP3 
Q20orf35 


51056 
55861 


170250 


NM^JSeQZ 
NM 018473, 
NM_016478 


leucine ammopeptidase 3 . — ■ 

chromosome 20 open reading frame 35 , — „ _ 


218237.5 at 
218242 s at 


SLC3BA1 

pct-es 


81539 

sim 




£M 030674 
NMJ016028. 
NM_016028 


sotute carrier family 38. member 1 

CGI-85 protein 


218325 & at 

218345 at 

218346 s_at 
218704 at 


_^!_1 

FLJ20315 


11083 

55365 
27244 


604140 
606103 


NM_022105. 
NM_022105. 
NMJJ8D79G 
NM 018487 
NM 014454 

KIM 


death associated transcription factor 1 . 

hepatoceHutar carcinoma-associated antigen 112 _ „__, _____ 


218802 at 
218898 at 
218943 s at 

218963 s et 
219956 at 


FLJ20647 
CT120 

bk_i 

KRT23 
GAL NTS 


55013 
79850 
23586 

25984 
11226 


606194 
605148 


NM 0,17018 
NM 024702 _ 
NM 014314 
NM_01S5l&. 
NM_015515 
NM 007210 


hypoineUcal protein FLJ20647 — 

membrane protein expressed in eprthetiaMike lung adenocarcinoma _ — 

DEAD/H (Asp-Glu-Ala-Asp/His) box polypeptide ~ 

keratin 23 (rustone deacelylese inducible) __ — ! "*T" 


220658 s at 
220951 s at 


ARNTL2 
ACF 


56938 
2S974 




NM 020183 
NMJS 14576, 
NM 014976. 
NM 136932 




221516 s^at 
221653 x at 


FW?0?32 
APOL2 


S4471 
23780 


„607252 


NM_030882, 
NM_030882 


[hypothetical protein FLJ20232 


221920 s at 
222244 s at 


MSCP 

IFLJ20618 


... .flfttt 




NMJ&16612, 
NM_016812 


mitochondnal *ftlm& earner orotem 

hvpothelical orotem FU20618 I 



Table 5. Genes used for the classification of MSS vs MSI tumors 



Name 

hepatocellular carcinoma-associated antigen 112 

metastasis-associated 1-!ike 1 

chemokine (C-X-C motif) ligand 10 

heterogeneous nuclear ribonucleoprotein L 

hypothetical protein FLJ20618 

splicing factor, arginine/$erine-rich6 

protein kinase C binding protein 1 



Symbol 


Unigene 


MSS 


MSI 


HCA112 


Hs.12126 


1261 


653 


MTA1L1 


Hs. 173043 


45 


91 


CXCL10 


Hs.2248 


104 


274 


HNRPL 


Hs.2730 


194 


630 


FLJ20618 


Hs.52184 


776 


388 


SFRS6 


Hs.6891 


74 


446 


PRKCBP1 


Hs.75871 


294 


168 



Table 6. Performance of the classifier 



Trainings set Test set 

Errors in crossvalidation Test errors 

MSI 2.8% (n=25 5 range 0-6) 1.4% (n= 10, range 0-4) 
MSS 0.70% (n-30, range 0-3) 0.52% (n=29, range 0-2) 
All 1 .7% (n=55, range 1 -7) 1 .9% (n=39, range 0-5) 



Table 7. 









Positive for MSS True = (0.9948*29)=28,8492 False = (0. 1 38* 1 0)= 1 .38 
Negative for MSS False = (0.0052*29)= 0. 1 508 True - (0.962* 10)= 9.62 


Sensitivity 
Specificity 

Positive predictive value 
Negative predictive value 


28.9507/29 = 99.5% 
9.62/10 = 96.2% 
28.8492/30.2292 = 95.4% 
9.62/9.7708 - 98.5% 



* Based on a prevalence for MSS of 85% 



